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Abstract 

We give an algorithm for calculating the maximum entropy state 
as the least fixed point of a Scott continuous mapping on the domain 
of classical states in their Bayesian order. 

1 Introduction 

Suppose an experiment has one of n different possible outcomes. These 
outcomes define a function a : {l,...,n} — > R we sometimes call an ob- 
servable. Now suppose in addition that we know that if this experiment 
were performed repeatedly that the average outcome (a) would be E. For 
any number of reasons we can imagine wanting to determine a distribution 
x e A n with (a\x) = E that is a good candidate for the 'actual probabili- 
ties.' As an example, if we have a six sided die, and we know the average roll 
is 3.5, then we could all agree that the 'best distribution' which matches the 
information we have is (1/6, . . . , 1/6), i.e., all sides are equally likely. But 
now suppose we know the die is biased in some way and that the average 
(a) is E ^ 3.5? 

The difficulty mathematically is that we only have two equations (x € A n 
and (a\x) = E) but that we are trying to solve for n unknowns. The 
maximum entropy principle offers an approach to this problem: it says we 
should choose the unique x 6 A™ with (a\x) = E whose entropy ax is 
maximum. Later we will give a proof that the problem has a unique solution 
and how to find it (because a complete detailed proof of this well known 
result either does not exist in the literature or is very good at hiding). 
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The maximum entropy principle originates from statistical mechanics, 
where it was known at least as far back as Gibbs that the Boltzmann dis- 
tribution is the unique distribution which maximizes the entropy while si- 
multaneously also yielding the observed average energy (oc temperature). 
But the development which appears to have led Jaynes in [7j to regard the 
maximum entropy principle as a legitimate form of inference was Shannon's 
information theory and its characterization of entropy as the unique 
function satisfying three axioms that a measure of uncertainty might obey. 
In Jaynes offered a very different explanation of statistical mechanics: 
it has a physical part, which serves to enumerate the states of a system 
and their properties, and it has an inferential aspect, where conclusions are 
drawn on the basis of incomplete information by (a) accepting entropy as 
the canonical measure of uncertainty in a distribution, partly because of 
Shannon's theorem, and (b) applying the maximum entropy principle. A 
benefit of this viewpoint is that the nonphysical assumptions required by 
traditional approaches to the subject - assumptions not derivable from the 
laws of motion - are replaced by a simple inference principle and we can 
still derive the usual computational rules in statistical mechanics, such as 
the Boltzmann distribution [7j. Since Jaynes' endorsement of the idea [7], 
the maximum entropy principle has been successfully applied to problems 
in many disciplines, including spectral analysis 3 , image reconstruction 
(00); and somewhat recently to natural language processing ( |2j JO] ) • 

There are many instances in the literature where it is necessary to calcu- 
late the maximum entropy state, and the methods we have seen look to be 
rather geared toward the application at hand. It is known that calculating 
this state amounts to being able to solve the equation 

n n 

E ■ ^ e Xa > = ^2 ai e Xa > 

i=l i=l 

for the Lagrangian multiplier A. But we have not seen any methods which 
have been proven to always work. (For instance, one could ask, does New- 
ton's method work, and if so, for which initial guesses? We don't know the 
answer to this question by the way, but we suspect that it always does.) In 
this paper we give a method for calculating the maximum entropy state that 
always applies, we prove that it works starting from any initial guess, and 
that it is really a domain theoretic idea in disguise: the maximum entropy 
state is the least fixed point of a Scott continuous map on the domain of 
classical states in their Bayesian order. 

The heart of the method makes use of basic techniques from numerical 
analysis. What requires work is to realize that these basic techniques are 
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perfectly suited for solving this nontrivial equation. This realization is only 
possible if one is persistent enough to continue manipulating some really 
dreadful sums until they take just the right form. All the arithmetic pays 
off in the end because there are some really neat domain theoretic ideas 
hidden beneath the surface useful for calculating the Lagrangian multiplier 
that defines the maximum entropy state. 

2 The equilibrium state 

For an integer n > 2, the classical states are 

( n 

A n := J x G [0,l] n --Yj Xi 
{ i=i 

A classical state x G A n is pure when X{ = 1 for some i G {1, . . . , n}; we 
denote such a state by e^. Pure states {ej}j are the actual states a system 
can be in, while general mixed states x and y are epistemic entities. 

Given a vector a : {1, . . . , n} — > M, sometimes called an observable, its 
average value is 

n 

(a\x) = ^2 aiXi 
i=i 

defined for x G A n . Shannon entropy a : A ra — > [0, oo) is 

n 

CTX = -^XjlogXj. 

1=1 

These ideas combine to give the existence and uniqueness of the equilibrium 
state associated to (energy) observable a in thermodynamics. 

Lemma 2.1 If a : {l,...,n} — > M is a vector, there is a unique classical 
state y G A ra suc/i i/iaf 

(a]y) — ay = inf{(a|x) — ax : x G A™}. 

T/ie state y is yiuen pointwise by y, L = e~ a% jZa and satisfies 

(a\y) - ay = - log Za 

where 

n 

Za :=Y—. 

Z — j P a,i 

i=i e 
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Proof. First, arithmetic gives (a\y) — cry = — log Za. Next, it is the mini- 
mum value of f(x) = {a\x} — ax on A n : 

f(x) = f{x) +log Za- log Za 

n 

= f{x)+^ j {x i \ogZa)-\ogZa 
i=i 

= ~ y~] log ( — ) Xi - log Za 

> ^2 ( 1 ~~ ~ ) x i ~ 1°§ Za (using log x < x — 1 for x > 0) 

= (l-X>)- lo 6 Za 

V x;>0 / 

> - log Za 

Finally, y is the unique state where / takes it minimum: If /(x) = — log Za, 
then the string of inequalities above implies 



-£*(!)*-E(i-i) 



.1 v' 



.i-;M> ' .<•. •() 

which can be rewritten as 



^(ti - 1 - \ogti)xi = 



Xi>0 



where U = yj/xj. Because logx < x — 1 for x > 0, this is a sum of nonneg- 
ative terms which results in zero. Then each term must be zero, so U = 1 
which means Xj = yi whenever Xi > 0. However, since 2/^ = 1 and each 
yi > 0, we must have x% > for a/H € {1, . . . , n}. Then x = y. □ 



3 The maximum entropy principle 

For the rest of the paper, we now fix an observable a:{l,...,n}— >M with 
cij < aj+i- If we know the probability Xj that outcome cij will occur, then 
we have a distribution x € A n which can be used to calculate the average 
value (a) of a: 

n 

(a) = (a|x) = ^ ajXj. 

i=i 
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But suppose all we know is that (a) = E. What distribution should we 
attribute this average to? The maximum entropy principle says we should 
choose the x £ A™ with (a\x) = E whose entropy ax is as large as possible. 
Then we seek to maximize ax subject to the constraints 

n n 

Xi = 1 and (a\x) = OjXj = E. 
i=i i=i 

Let us think about this carefully. First, assuming for the moment that 
S = (a\-)~ 1 ({E}) / 0, the problem has a solution because a is continuous 
on a nonempty closed subset S of the compact set A n . Call this solution 
x. If there were another solution y ^ x, then it too would satisfy (a\y) = E 
and ay = ax. But then z = (x + y)/2 € A ra would have (a\z) = E and by 
the strict concavity of entropy, 

a(z) > (l/2)ax + (l/2)ay = ax, 

which contradicts the fact that x maximizes the entropy. So the problem 
has a unique solution and we call it the maximum entropy state. 

In every reference we have seen in the literature, the next step has been to 
apply Lagrangian multipliers to determine a candidate for the maximum of 
a on S. But Lagrangian multipliers only applies to regions where all partial 
derivatives of a (and the two constraints) exist, and then it is only capable 
of detecting extrema that occur at interior points of such a region. The 
partial derivatives of a do not exist on the boundary of A n . So if we want a 
guarantee that Lagrangian multipliers will yield the maximum entropy state, 
we need to know that the maximum entropy state does not occur on the 
boundary of A™. From the point of view of a pure optimization problem, it 
is not clear why the maximum should always be taken on the interior of A™. 
Though it proves to be largely technical, we still should make the following 
point: it is mathematically incorrect to apply Lagrangian multipliers in this 
situation and then assert that it always yields the maximum entropy state. 

As an example, if E = a±, then e\ is the maximum entropy state, which 
lies along the boundary. For E = a n , the maximum entropy state is e n . 
Ignoring this for a moment, a suspect application of Lagrangian multipliers 
suggests that the maximum entropy state y is given by 

Ei=i e Affll 

where A € M. satisfies 

n n 

E-^2e Xa > = ^2a ie Xa \ 
i=i i=i 
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At this stage, we have no way of knowing which one of the following is true: 



(i) The maximum entropy state occurs on the boundary, in which case it 
is not the state y, so Lagrangian multipliers is of no use in finding it, 

(ii) The maximum entropy state occurs at an interior point, in which case 
the equation for A has at least one solution, and one of these solutions 
will yield the maximum entropy state. But which one? 

Though A may not even exist, it turns out that there are only two cases 
where this is true: E = a\ and E = a n . Otherwise, A exists uniquely, and 
defines the unique state y, leaving three possibilities: y is the maximum 
entropy state, a minimum, or neither. Thankfully: 

Proposition 3.1 Let a : {l,...,n} — > R be a vector with a; < Oj+i and 
a\ < E < a n . There is a unique A G M with 

n n 
i=l t=l 

The state y G A n given by 

Vl = £?=i 

satisfies 

(a\y) = E and ay = supjux : (a\x) = E & x G A"} 

Thus, y is the only state with these two properties, i.e., it is the maximum 
entropy state. 

Proof. First suppose that a solution A to the equation exists. We will prove 
that the associated y has to be the maximum entropy state. To do so, define 
a new observable b = — Xa by bi = — Aoj. Now take the equilibrium state y 
associated to b given by 

i=l 

By Lemma 12. II we also have 

(b\y) — ay = inf{{b\x) — ax : x G A n }. 
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Now we can prove that y is the maximum entropy state. First, 

\a,i \ v-m „.„Xai 



W = I> ^J = — IT" 

i=l v 7 



= £ 



using our assumption about A. Next, let x be any other state with (a\x) = E. 
Then (b\x) = -XE = (b\y). Thus, 

— XE — ax = (b\x) — ax > (b\y) — ay = —XE — ay 

because y is the equilibrium state for b. This proves ay > ax. 

Now suppose the equation has two solutions A and (3. Then each defines 
the maximum entropy state (which we know is unique) so for all i, 



e \a,i e /3a,i 



Vi 



E"=i ^ E?=i ^ 

In particular, 

_ gA(ai-a2) _ e p(ai-a2) 

V2 

which means X(a\ — 02) = (3(a\ — 02), and since a\ < ci2, we get X = (3. 
Thus, A is unique assuming it exists. 

Last, we prove that a solution to the equation exists. Define 

/(x ) = ^ E. 

Using a\ < a n , we have 

lim f(x) = a\ — E & lim f(x) = a n — E. 

x—>—oo x^oo 

Because a\ < E < a n , taking c close to —00 gives /(c) < while d close to 
00 gives f(d) > 0. The continuity of / yields a A with /(A) = 0. □ 

For the case E = a±, the only state x with (a\x) = a± is x = e±, so the 
maximum entropy state is ei; for E = a n , it is c n . Then 

(3x) (a\x) = E 44> ai < E < a n 

So, the maximum entropy state exists if and only if E £ [ai,a n ]. This is 
very intuitive because it says that there is a solution iff the expectation lies 
between the smallest and largest observable values. Going through a proof 
of this well known result in detail provides us with some of the technical 
ideas that will be useful in designing an algorithm for actually calculating 
the maximum entropy state. 
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4 An algorithm for calculating A 



Define 



and 



V^n n . p xai 

f(x) = ^=J aie _ - E 



E?=l 



I f (x) = X 



(a n - ax) 2 
for any 

Lemma 4.1 Let a\ < E < a n . 

(i) For all x € R, we have < I' f (x) < 1. 

(ii) The map If has a unique fixed point A and IJ{x) — ► A /or eac/t 

Proof, (i) To prove < f'(x) < (a n — a\) 2 , we first calculate f'(x). This 
takes a while if we want to simply it as much as possible: 



/'(*) 



1 


(E, 






1 


(E, 






1 


(E, 






1 


(E, 






1 


(E, 






1 


(E, 


e xai"j2 




1 



(Ei* 



xaiA2 



n n 



=1 i=l 



l<i j<n 



e e J a,- 



i=l i=l 



Exa-j x 
aittje e 

l<i,j<n 



l<i,j<n 
l<i^j<n 

J] ' r'" ir/j a^) • r''" r™Mr/j ,/;,/_,•; 

l<i<j<n l<j<i<n 

^ r'" r'" a^) • e xa * e xai (a? - a^) 

l<i<jr<n l<i<j'<n 

x: ^^(aj-ai) 1 

l<i<j<n 
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This proves f'(x) > 0. To prove f'(x) < (a n - ai) 2 , 



l<i<J<n 



(a ra - ai) 2 _ | x - XOi xa . 

VZ ^ 1 ; \l<i<j<n 

< (a n - ai) 2 



where the first inequality follows from the increasingness of a, and the second 
inequality uses 

/ n \ 2 

\i=l / l<i,i<« l<!<j<n 

Then < /'(x) < (a n - ai) 2 which gives < < 1. 

(ii) For the A with /(A) = we have //(A) = A. Given set 

I x = [minja;, A}, max{x, A}] 

c x = sup 
tei x 

Because I'j is a continuous function on a compact set I x , it assumes its 
absolute maximum at some point t* € I x , i.e., c x = I'^(t*). This proves 
< c x < 1. Then by the mean value theorem and the fact that l'^ > 0, 

d(I f (a),I f (b)) <c x -d(a,b) 

for all a,b G J x , where d is the usual metric on R. But If(I x ) Q Ix because 
// maps sets of the form [x, A] and [A, x] to themselves, using the strict 
monotonicity of / that follows from /' > 0, and the two equivalences 

x < If(x) = x < A &; If(x) < x = A < x. 

Thus, If is a contraction on I x , so it has a unique fixed point on I x , which 
must be A, and IJ(x) — > A. □ 
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Remark 4.2 If is not a contraction because its derivative gets arbitrarily 
close to one: 

lim I'Ax) = lim l' f {x) = 1 

Any contraction constant c < 1 would have to bound its derivative from 
above. A forthcoming work will study why 'functions like these' have canon- 
ical fixed points. 

The equivalences 

x < I fix) = x < A & If{x) < x = A < x. 

are important because they allow us to determine properties of A without 
actually knowing A. For instance, A > iff //(0) > 0. The advantage is that 
1/(0) > can be determined computationally, while testing A > would 
require us to know the value of A. 

5 From bottom to the maximum entropy state 

The calculation of A given in the last section also has a formulation in terms 
of classical states: the maximum entropy state is the least fixed point of a 
Scott continuous map on the A n in its Bayesian order. 

Definition 5.1 A poset P is a partially ordered set. A nonempty subset 
S C P is directed if (Vx,?/ € S)(3z € S)x,y C z. The supremum \_\ S of 
S C P is the least of its upper bounds when it exists. A dcpo is a poset in 
which every directed set has a supremum. 

A function / : D — > E between dcpo's is Scott continuous if it is monotone, 

(\/x, y € D)x C y => /(x) C /(y), 
and preserves directed suprema, 

/(U 5 ) = U^)> 

for all directed S C. D. Like complete metric spaces, dcpo's also have a 
result which guarantees the existence of canonical fixed points. 

Theorem 5.2 Let D be a dcpo with a least element _L. If f : D — > D is a 
Scott continuous map, then 

fix(/) := □ /"(J.) 

n>0 

is t/ie /east /zxeci point of f on D. 
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The set of classical states A™ has a natural domain theoretic structure, 
too many of them in fact. The one of interest in this paper is the Bayesian 
order [I], which we now briefly consider. 

Imagine that one of n different outcomes is possible. If our knowledge of 
the outcome is x G A n , and then by some means we determine that outcome 
i is not possible, our knowledge improves to 

Pi(x) = — — (xi, . . . ,£i, . . . ,x n +i) G A™, 

where Pi(x) is obtained by first removing Xj from x and then renormalizing. 
The partial mappings which result, pi : A n+1 — ^ A n with dom(pj) = A n+1 \ {e^}, 
are called the Bayesian projections and lead one to the following relation 
on classical states. 

Definition 5.3 For x,y G A n+1 , 

xQy = (yi)(x,y G dom(pi) =^ Pi(x) Qpi(y)). 

For x, y G A 2 , 

x C y = (yi < X! < 1/2) or (1/2 < x x < Vl ) . 

The relation C on A n is called the Bayesian order. 

The Bayesian order was invented in 4] where the following is proven: 

Theorem 5.4 (A n , C) is a dcpo with least element _L := (1/n, . . . , 1/n) and 
max(A n ) = {e-i : 1 < i < n}. 

The Bayesian order has a more direct description: The symmetric for- 
mulation (!]. Let S(n) denote the group of permutations on {1, . . . ,n} and 

A n : = { x G A n : (Vi < n) x { > x i+ i} 

denote the collection of monotone decreasing classical states. 

Theorem 5.5 For x,y G A n , we have x Q y iff there is a permutation 
a G S(n) such that x ■ a, y ■ a G A n and 

(x • a)i(y ■ a) i+ i < (x • a) i+ i(y ■ a)i 

for all i with 1 < i < n. 
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Thus, (A ra , C) can be thought of as n\ many copies of the domain (A n , C) 
identified along their common boundaries, where (A n , C) is 

x Q y = (Vi < n) x^+i < Si+ij/j. 

It should be remarked though that the problems of ordering A™ and A n 
are very different, with the latter being far more challenging, especially if 
one also wants to consider quantum mixed states. Now to the fixed point 
theorem. 

Definition 5.6 Define A : A n -> R U {±00} by 

° g ^°rt M J if I f (0) > 0; 



A(x) 



I sort(a02 \ 
'V sort(a:)2 / 

ai — a,2 



otherwise. 



with the understanding for pure states that Ax = 00 in the first case and 
Xx = —00 in the other. The map sort puts states into decreasing order. 

Lemma 5.7 For a function a : {1, . . . , n} — ► R with ai < ctj+i, 

(i) If X> 0, then 

(Vx, y G A n ) x □ y Ax < Ay 
in the Bayesian order on A". 

(ii) IfX<0, then 

(Vx, y G A n ) x Qy ^ Xx> Xy 
in the Bayesian order on A n . 

That is, the sign of the fixed point A = //(A) determines whether A : A n — > R 
is monotone increasing (A > 0) or monotone decreasing (A < 0). 

Theorem 5.8 Let a\ < E < a n . The map cf) : A n — > A n given &y 

0( x ) = ( e J/(A*)ai ; . . . 5 e */ . _L 

Z(x) 

n 

Z(x) =^ e / /(^K 
i=i 

zs 5cott continuous in the Bayesian order. Its least fixed point is the maxi- 
mum entropy state. 
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Proof. Let x Q y in the Bayesian order. If y € max(A n ), then since 
Ay = ±00, the intent of the definition is a limit. The state (f>y is either e n 
or e\. It follows that <f)x C cj>y. Assume now that y max(A n ). 

Now suppose A > 0. Then = XI < Xx < Xy so < 1/(0) < //(Ax) < 
If(Xy). Then <f>(x) and cp(y) are increasing states and we get 

4>{%) E 4>{y) & (VI < « < n) I f (Xx)(a i+ i - <n) < If(Xy)(a i+1 - a*) 
//(Ax) < //(Ay) 
44> Xx < Xy 

and this is true since A > implies A : A n — > R is monotone increasing. For 
A < 0, Ay < Xx < A_L = 0, so //(Ay) < //(Ax) < //(0) < 0. Then <j>(x) and 
0(y) are decreasing states and we get 

4>(x) C 0(y) <=> (VI < i < n) //(Ax)(aj - a i+1 ) < //(Ay)(aj - a i+1 ) 
& //(Ax) > //(Ay) 
44> Xx > Xy 

and this is true since A < implies A : A n — ► R is monotone decreasing. This 
proves <f> is monotone. It is Scott continuous because it is Euclidean con- 
tinuous and suprema in the Bayesian order are pointwise Euclidean limits. 
Finally, 

X(<f>x) = //(Ax) 

and so by induction 

A(0 n x) = If(Xx). 
Then its least fixed point fix(0) must satisfy 

A(fix(c/>)) = A(| \<j) n (±)) = A( lim n (i_)) = lim A(c/> n (J_)) = lim J5(0) = A 
which gives 

fix(0) = 0(fix(0)) = (e Aai , . . . , e Aa ") • 1 Ao . 

2J*=i e 1 

the maximum entropy state. □ 

The map <j> is not monotone with respect to majorization (A n ,C). To 
see why, take a problem where A = 0, then 1/(0) = 0, which means Ax < 0. 
Then : A n -> A n . Let x = (1/2,2/5,1/10), y = (1/2,1/2,0). Then x C y 
in majorization. Because Ax < 0, //(Ax) < 0, so 0(x) ^ _L. However, 
</>(y) = _L, which means (ft(x) [2 <t>{y) m majorization. It is not immediately 
clear whether <f> is monotone in the implicative order (Hj . Notice though that 
Lemma 15.71 is also valid for the implicative order. 



13 



6 A conspiracy theory 



The Bayesian projections (pi) used to define the Bayesian order relate to 
entropy in a special way: 

ax = (1 - x k )ap k {x) + a(x k , 1 - x k ) 

for any k with x k ^ 1. This property might imply Shannon's additivity 
property or 'the recursion axiom' so that any function satisfying the equation 
above and the two other usual axioms has to be entropy to within a constant. 
This equation looks like it almost means something. // we knew what it 
meant, and we could also use it to establish a plausible link to the Bayesian 
order, then we might try to prove this uniqueness. 

7 Etc. 

The things in this paper which are original (to the best of our knowledge) 
are that If always iterates to A and that it can be used to define the Scott 
continuous 4> whose least fixed point is the maximum entropy state. 

The operator (j) might have a meaningful interpretation: we begin with 
_L, and then with each iteration probabilities are adjusted based on the 
information ((a\x) = E) we have, until the limit gives us just the right state. 
There should be a logic that captures the type of inference provided by the 
maximum entropy principle: states are propositions. Perhaps the logic we 
are looking for treats observables as incomplete descriptions of propositions: 

(i) Maybe the logic has sequents of the form I —> q where q is a proposition 
and I is information which partially describes a proposition. In this 
logic, it should be a theorem that a,E — > Rx(<fi). Is it possible to 
extract a 'proof of this theorem from (0 ra _L)? Or is the theorem 
fix(0) — > (a) = E? What is the logic of statistical mechanics? 

(ii) The maximum entropy principle and some of its variants might all 
have some underlying qualitative component. Maybe it is possible to 
explain how from a certain kind of logic one can extract a statistical 
inference method. Maybe one has a choice about how to write the 
maximum entropy principle: either in the language of expectations, 
entropy and optimization techniques, or as a logic that will probably 
annoy a lot of people. 

(hi) Maybe 4> has an informatic derivative at fix(</>). 
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